618 research outputs found
Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes
open4siHigh-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.openAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, LucaAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, Luc
A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters
The instruction memory hierarchy plays a critical role in performance and energy efficiency of ultralow-power (ULP) processors for the Internet-of-Things (IoT) end-nodes. This is mainly due to the extremely tight power envelope and area budgets, which imply small instruction-caches (I-Cache) operating at very low supply voltages (near-threshold). The challenge is aggravated by the fact that multiple processors, fetching in parallel, require plenty of bandwidth from the I-Caches. In this letter, we propose a low-cost and energy efficient hybrid instruction-prefetching mechanism to be integrated with a ULP multicore cluster. We study its performance for a wide range of IoT applications, from cryptography to computer vision, and show that it can effectively improve the hit-rate of almost all of them to above 95% (average performance improvement of over 2 \times ). In addition, we designed our prefetcher and integrated it in a 4-cores cluster in 28 nm fully-depleted silicon-on-insulator (FDSOI) technology. We show that system's power consumption increases only by about 11% and silicon area by less than 1%. Altogether, a total energy reduction of 1.9x is achieved, thanks to more than 2x performance improvement, enabling a significantly longer battery life
Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters
High Performance and Energy Efficiency are critical requirements for Internet
of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable
processors (CMPs) has recently emerged as a suitable solution to address this
challenge. One of the main bottlenecks limiting the performance and energy
efficiency of these systems is the instruction cache architecture due to its
criticality in terms of timing (i.e., maximum operating frequency), bandwidth,
and power. We propose a hierarchical instruction cache tailored to
ultra-low-power tightly-coupled processor clusters where a relatively large
cache (L1.5) is shared by L1 private caches through a two-cycle latency
interconnect. To address the performance loss caused by the L1 capacity misses,
we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to
L1.5. We optimize the core instruction fetch (IF) stage by removing the
critical core-to-L1 combinational path. We present a detailed comparison of
instruction cache architectures' performance and energy efficiency for parallel
ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level
instruction cache provides better scalability than existing shared caches,
delivering up to 20\% higher operating frequency. On average, the proposed
two-level cache improves maximum performance by up to 17\% compared to the
state-of-the-art while delivering similar energy efficiency for most relevant
applications.Comment: 14 page
An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics
Near-sensor data analytics is a promising direction for IoT endpoints, as it
minimizes energy spent on communication and reduces network load - but it also
poses security concerns, as valuable data is stored or sent over the network at
various stages of the analytics pipeline. Using encryption to protect sensitive
data at the boundary of the on-chip analytics engine is a way to address data
security issues. To cope with the combined workload of analytics and encryption
in a tight power envelope, we propose Fulmine, a System-on-Chip based on a
tightly-coupled multi-core cluster augmented with specialized blocks for
compute-intensive data processing and encryption functions, supporting software
programmability for regular computing tasks. The Fulmine SoC, fabricated in
65nm technology, consumes less than 20mW on average at 0.8V achieving an
efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to
25MIPS/mW in software. As a strong argument for real-life flexible application
of our platform, we show experimental results for three secure analytics use
cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN
consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with
secured remote recognition in 5.74pJ/op; and seizure detection with encrypted
data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE
Transactions on Circuits and Systems - I: Regular Paper
A -1.8V to 0.9V body bias, 60 GOPS/W 4-core cluster in low-power 28nm UTBB FD-SOI technology
A 4-core cluster fabricated in low power 28nm UTBB FD-SOI conventional well technology is presented. The SoC architecture enables the processors to operate 'on-demand' on a 0.44V (1.8MHz) to 1.2V (475MHz) supply voltage wide range and -1.2V to 0.9V body bias wide range achieving the peak energy efficiency of 60 GOPS/W, (419\u3bcW, 6.4MHz) at 0.5V with 0.5V forward body bias. The proposed SoC energy efficiency is 1.4x to 3.7x greater than other low-power processors with comparable performance
Characterization and Implementation of Fault-Tolerant Vertical Links for 3-D Networks-on-Chip
Through silicon vias (TSVs) provide an efficient way to support vertical communication among different layers of a vertically stacked chip, enabling scalable 3-D networks-on-chip (NoC) architectures. Unfortunately, low TSV yields significantly impact the feasibility of high-bandwidth vertical connectivity. In this paper, we present a semi-automated design flow for 3-D NoCs including a defect-tolerance scheme to increase the global yield of 3-D stacked chips. Starting from an accurate physical and geometrical model of TSVs: 1) we extract a circuit-level model for vertical interconnections; 2) we use it to evaluate the design implications of extending switch architectures with ports in the vertical direction; moreover, 3) we present a defect-tolerance technique for TSV-based multi-bit links through an effective use of redundancy; and finally, 4) we present a design flow allowing for post-layout simulation of NoCs with links in all three physical dimensions. Experimental results show that a 3-D NoC implementation yields around 10% frequency improvement over a 2-D one, thanks to the propagation delay advantage of TSVs and the shorter links. In addition, the adopted fault tolerance scheme demonstrates a significant yield improvement, ranging from 66% to 98%, with a low area cost (20.9% on a vertical link in a NoC switch, which leads a modest 2.1% increase in the total switch area) in 130 nm technology, with minimal impact on very large-scale integrated design and test flows
High Performance Ambipolar Field-Effect Transistor of Random Network Carbon Nanotubes
Ambipolar field-effect transistors of random network carbon nanotubes are fabricated from an enriched dispersion utilizing a conjugated polymer as the selective purifying medium. The devices exhibit high mobility values for both holes and electrons (3 cm(2)/V.s) with a high on/off ratio (10(6)). The performance demonstrates the effectiveness of this process to purify semiconducting nanotubes and to remove the residual polymer
Overall Survival With Palbociclib And Fulvestrant in Women With HR+/HER2– ABC: Updated Exploratory Analyses of PALOMA-3, a Double-Blind, Phase 3 Randomized Study
Purpose: To conduct an updated exploratory analysis of overall survival (OS) with a longer
median follow-up of 73.3 months and evaluate the prognostic value of molecular analysis by
circulating tumor DNA (ctDNA).
Patients and methods: Patients with hormone receptor−positive/human epidermal growth
factor receptor 2−negative (HR+/HER2−) advanced breast cancer (ABC) were randomized 2:1 to
receive palbociclib (125 mg orally/d; 3/1 week schedule) and fulvestrant (500 mg
intramuscularly) or placebo and fulvestrant. This OS analysis was performed when 75% of
enrolled patients died (393 events in 521 randomized patients). ctDNA analysis was performed
among patients who provided consent.
Results: At the data cutoff (August 17, 2020), 258 and 135 deaths occurred in the palbociclib
and placebo groups, respectively. The median OS (95% CI) was 34.8 months (28.8−39.9) in the
palbociclib group and 28.0 months (23.5−33.8) in the placebo group (stratified hazard ratio
0.81; 95% CI, 0.65−0.99). The 6-year OS rate (95% CI) was 19.1% (14.9−23.7) and 12.9%
(8.0−19.1) in the palbociclib and placebo groups, respectively. Favorable OS with palbociclib
plus fulvestrant compared with placebo plus fulvestrant was observed in most subgroups,
particularly in patients with endocrine-sensitive disease, no prior chemotherapy for ABC, low
circulating tumor fraction, and regardless of ESR1, PIK3CA, or TP53 mutation status. No new
safety signals were identified.
Conclusions: The clinically meaningful improvement in OS associated with palbociclib plus
fulvestrant was maintained with >6 years of follow-up in patients with HR+/HER2− ABC,
supporting palbociclib plus fulvestrant as a standard of care in these patients.
Trial Registration: ClinicalTrials.gov Identifer: NCT0194213
Overall Survival with Palbociclib and Fulvestrant in Women with HR+/HER2− ABC: Updated Exploratory Analyses of PALOMA-3, a Double-blind, Phase III Randomized Study
Purpose: To conduct an updated exploratory analysis of overall
survival (OS) with a longer median follow-up of 73.3 months and
evaluate the prognostic value of molecular analysis by circulating
tumor DNA (ctDNA).
Patients and Methods: Patients with hormone receptor–positive/
human epidermal growth factor receptor 2–negative (HRþ/HER2)
advanced breast cancer (ABC) were randomized 2:1 to receive
palbociclib (125 mg orally/day; 3/1 week schedule) and fulvestrant
(500 mg intramuscularly) or placebo and fulvestrant. This OS
analysis was performed when 75% of enrolled patients died (393
events in 521 randomized patients). ctDNA analysis was performed
among patients who provided consent.
Results: At the data cutoff (August 17, 2020), 258 and 135 deaths
occurred in the palbociclib and placebo groups, respectively.
The median OS [95% confidence interval (CI)] was 34.8 months
(28.8–39.9) in the palbociclib group and 28.0 months (23.5–33.8)
in the placebo group (stratified hazard ratio, 0.81; 95% CI, 0.65–
0.99). The 6-year OS rate (95% CI) was 19.1% (14.9–23.7) and
12.9% (8.0–19.1) in the palbociclib and placebo groups, respectively. Favorable OS with palbociclib plus fulvestrant compared
with placebo plus fulvestrant was observed in most subgroups,
particularly in patients with endocrine-sensitive disease, no prior
chemotherapy for ABC and low circulating tumor fraction and
regardless of ESR1, PIK3CA, or TP53 mutation status. No new
safety signals were identified.
Conclusions: The clinically meaningful improvement in OS
associated with palbociclib plus fulvestrant was maintained with
>6 years of follow-up in patients with HRþ/HER2 ABC, supporting palbociclib plus fulvestrant as a standard of care in these patients
- …